트리온 프로그래밍 입문: 병렬 실행 모델: 블록으로 생각하기

시리얼 CPU 프로그래밍에서 GPU 프로그래밍으로 전환하려면 패러다임의 전환이 필요합니다. 요소별 반복에서 블록 기반 실행로 전환해야 합니다. 이제 데이터를 스칼라의 흐름으로 보지 않고, 하드웨어 대역폭을 포화시키도록 예약된 "블록"들의 집합으로 간주하게 됩니다.

1. 메모리 제한과 계산 제한

커널의 성능 저하는 수학 연산 횟수와 메모리 접근 횟수의 비율에 의해 결정됩니다. 벡터 덧셈은 일반적으로 메모리 제한 라는 이유는 매 3회의 메모리 작업(2번 로드, 1번 스토어)당 하나의 덧셈만 수행하기 때문입니다. 하드웨어는 드램에서 데이터를 기다리는 시간보다 계산 시간이 더 길어집니다.

2. BLOCK_SIZE의 역할

BLOCK_SIZE 병렬 처리의 세부 정도를 정의합니다. 너무 작으면 GPU의 넓은 실행 경로를 충분히 활용하지 못합니다. 최적의 크기는 메모리 버스를 포화시키기 위해 충분한 "처리 중인 작업"을 보장합니다.

3. 점유도를 통한 지연 숨기기

점유도 GPU상의 활성화된 블록 수를 의미합니다. 궁극적인 목표는 아니지만, 스케줄러가 다른 블록이 VRAM에서 고지연 메모리 가져오기를 기다릴 때 새로운 블록을 불러와 계산을 수행할 수 있도록 해줍니다.

4. 하드웨어 활용도

최고의 성능을 내기 위해서는 우리의 BLOCK_SIZE GPU 아키텍처의 메모리 결합 규칙과 일치시켜야 하며, 연속적인 스레드가 연속적인 메모리 주소에 접근하도록 보장해야 합니다.

TERMINALbash — 80x24

> Ready. Click "Run" to execute.

QUESTION 1

For a kernel that adds two vectors ($out = x + y$), what is the most likely bottleneck on modern GPUs?

Arithmetic Throughput

Memory Bandwidth

Shared Memory Latency

QUESTION 2

What is the primary purpose of 'Occupancy' in the GPU execution model?

To ensure every thread runs as fast as possible.

To hide memory latency by keeping work in flight.

To increase the clock speed of the compute units.

To reduce the power consumption of the HBM.

QUESTION 3

Which of the following describes 'Memory-Bound' behavior?

The GPU is waiting for the memory bus to deliver data.

The GPU has exhausted its available VRAM.

The kernel is performing too many complex floating-point operations.

The CPU cannot launch kernels fast enough.

QUESTION 4

What happens if the BLOCK_SIZE is set too small?

The kernel will fail with a memory error.

The GPU fails to utilize its wide SIMD execution lanes.

The memory bandwidth increases significantly.

QUESTION 5

In the logistics warehouse analogy, what represents the 'Blocks'?

The individual items.

The workers.

The organized pallets.

The delivery trucks.